A Complete OCR System Development of Tamil Magazine Documents

نویسنده

  • K. H. Aparna
چکیده

We present an early version of a complete Optical Character Recognition (OCR) system for Tamil magazine documents. All the standard elements of OCR process like deskewing, preprocessing, segmentation, character recognition and reconstruction are implemented. Experience with OCR problems teaches that for most subtasks involved in OCR, there is no single technique that gives perfect results for every type of document image. We have used the ability of artificial neural networks to learn arbitrary input/output mappings from sample data for solving the key problems of segmentation and character recognition. Text segmentation of Tamil newsprint poses a new challenge owing to its italic-like font type; problems that arise in recognition of touching and close characters are discussed. Character recognition efficiency varied from 94 to 97% for this particular type of font. The grouping of blocks into logical units and the determination of read order within each logical unit helped us in reconstructing automatically the document image in editable format. Final reconstruction is done in HTML format.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Complete Tamil Optical Character Recognition System

The aim of the present work is to recognise printed Tamil text. Though commercial Optical Character Recognition (OCR) packages are available in the market for Roman Script, not much work has been done in the field of OCR for Indian languages. Indian scripts usually have a large number of symbols and hence, recognition is a challenging task. In the current context, a complete OCR in printed Tami...

متن کامل

Language Identification in Multilingual Documents

Most optical character recognition (OCR) systems can recognize at most a few languages. For large archives of document images that contain different languages, there must be some way to automatically categorize these documents before applying the proper OCR on them. This report presents a research in the identification of English, Chinese, Malay and Tamil in image documents. While most other wo...

متن کامل

A Complete Machine printed Gurmukhi OCR System

Recognition of Indian language scripts is a challenging problem. Work for the development of complete OCR systems for Indian language scripts is still in infancy. Complete OCR systems have recently been developed for Devanagri and Bangla scripts. Research in the field of recognition of Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectiv...

متن کامل

A complete OCR for printed Tamil text

A Neural Network approach is proposed to build an automatic off-line handwritten Tamil character recognition system. We have used a Back Propagation Network (BPN) as a character recognizer. Once trained, the network has a very fast response time. However, the learning phase of this recognizer is a relatively difficult task in this application. The input image of the handwritten character is giv...

متن کامل

On Automatic Similarity Linking in Digital Libraries

Hypertext links are a powerful extension of standard information retrieval techniques based on query languages. However, the generation of links is often impractical due to large manual and/or computational effort. In this paper, we analyze the effects of two main approaches that aim at a restriction of the necessary efforts: The direct use of OCR-processed documents instead of manually post-pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003